{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Answer: <think>\n",
      "Okay, so I need to figure out what the main topic of the given documents is. Let me start by reading through the context carefully.\n",
      "\n",
      "The context begins by introducing Desaiv, an insurtech company focused on health insurance management using AI-powered solutions. It was founded in 2022 and operates mainly in the MENA region. They secured a pre-seed funding of $2 million from several investors including 500 Global, Terra VC, Oqal angel investors, and others from the MENA, UK, and USA.\n",
      "\n",
      "The mission is to revolutionize the insurance sector with AI, achieving over 95% prediction accuracy using data from top-tier universities like Oxford. The platform offers various features such as effortless data management, advanced OCR technology, AI-powered analytics, future cost predictions, and customization options. It also lists specific features like Claims Experience Analytics, Quotation Comparison, Request New Quotation, Provider Comparison, and an Underwriter Tool.\n",
      "\n",
      "The required files for using the platform's features are mentioned, including Claims Experience, Active List, Raw Data (optional), and Providers List. There's a contact section with phone number and email.\n",
      "\n",
      "So, putting this together, the documents primarily talk about Desaiv, its operations in the MENA region, their funding, mission, AI capabilities, platform features, required data inputs, and how to get in touch.\n",
      "\n",
      "The main topic would be Desaiv as an insurtech company using AI for health insurance management. The documents cover their services, technology, market presence, funding, and operational aspects.\n",
      "</think>\n",
      "\n",
      "The main topic of the documents is Desaiv, an insurtech company that specializes in managing health insurance through AI-powered solutions. It operates in the MENA region, has secured pre-seed funding from notable investors, and aims to revolutionize the insurance sector with advanced analytics and automation tools. The platform offers features such as claims analytics, quotation comparison, provider networks, and underwriting tools, utilizing data inputs like claims experience and providers lists.\n",
      "\n",
      "Answer: Desaiv is an insurtech company focused on health insurance management using AI solutions, operating in the MENA region with a focus on innovation through advanced analytics and automation.\n"
     ]
    }
   ],
   "source": [
    "import os\n",
    "import PyPDF2\n",
    "import faiss\n",
    "from transformers import AutoTokenizer, AutoModel\n",
    "import ollama\n",
    "\n",
    "# Step 1: Extract text from PDFs\n",
    "def extract_text_from_pdf(pdf_path):\n",
    "    with open(pdf_path, \"rb\") as f:\n",
    "        reader = PyPDF2.PdfReader(f)\n",
    "        text = \"\"\n",
    "        for page in reader.pages:\n",
    "            text += page.extract_text()\n",
    "        return text\n",
    "\n",
    "# Read all PDFs from the 'docs' directory\n",
    "pdf_dir = \"docs\"\n",
    "pdf_files = [os.path.join(pdf_dir, file) for file in os.listdir(pdf_dir) if file.endswith(\".pdf\")]\n",
    "\n",
    "# Extract text from all PDFs\n",
    "documents = [extract_text_from_pdf(pdf) for pdf in pdf_files]\n",
    "\n",
    "# Step 2: Generate embeddings for documents\n",
    "# Use a pre-trained model for embeddings\n",
    "embedding_model_name = \"sentence-transformers/all-MiniLM-L6-v2\"  # Change if needed\n",
    "tokenizer = AutoTokenizer.from_pretrained(embedding_model_name)\n",
    "model = AutoModel.from_pretrained(embedding_model_name)\n",
    "\n",
    "def embed_text(text):\n",
    "    inputs = tokenizer(text, return_tensors=\"pt\", truncation=True, padding=True)\n",
    "    outputs = model(**inputs)\n",
    "    return outputs.last_hidden_state.mean(dim=1).detach().numpy()\n",
    "\n",
    "# Generate embeddings for each document\n",
    "embeddings = [embed_text(doc) for doc in documents]\n",
    "\n",
    "# Step 3: Create a Faiss index\n",
    "dimension = embeddings[0].shape[1]  # Embedding dimension\n",
    "index = faiss.IndexFlatL2(dimension)\n",
    "\n",
    "# Add document embeddings to the Faiss index\n",
    "for emb in embeddings:\n",
    "    index.add(emb)\n",
    "\n",
    "# Step 4: Use DeepSeek model in Ollama\n",
    "def query_with_ollama(question, retrieved_text):\n",
    "    client = ollama.Client()\n",
    "    response = client.generate(\n",
    "        model=\"deepseek-r1:7b\",  # Change if using a different model\n",
    "        system=\"You are a helpful assistant answering questions using retrieved context.\",\n",
    "        prompt=f\"Context: {retrieved_text}\\n\\nQuestion: {question}\\nAnswer:\"\n",
    "    )\n",
    "\n",
    "    return answer.get(\"response\", \"\")\n",
    "\n",
    "# Step 5: Query the pipeline\n",
    "def query_pipeline(question, top_k=1):\n",
    "    # Convert question to an embedding\n",
    "    question_embedding = embed_text(question)\n",
    "\n",
    "    # Search the Faiss index\n",
    "    distances, indices = index.search(question_embedding, top_k)\n",
    "\n",
    "    # Retrieve the most relevant document(s)\n",
    "    retrieved_text = \"\\n\".join([documents[i] for i in indices[0]])\n",
    "\n",
    "    # Generate an answer using DeepSeek in Ollama\n",
    "    return query_with_ollama(question, retrieved_text)\n",
    "\n",
    "# Example usage\n",
    "if __name__ == \"__main__\":\n",
    "    question = \"What is the main topic of the documents?\"\n",
    "    answer = query_pipeline(question)\n",
    "    print(\"Answer:\", answer)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "\"<think>\\nOkay, so I need to figure out what the main topic of the given documents is. Let me start by reading through the context carefully.\\n\\nThe context begins by introducing Desaiv, an insurtech company focused on health insurance management using AI-powered solutions. It was founded in 2022 and operates mainly in the MENA region. They secured a pre-seed funding of $2 million from several investors including 500 Global, Terra VC, Oqal angel investors, and others from the MENA, UK, and USA.\\n\\nThe mission is to revolutionize the insurance sector with AI, achieving over 95% prediction accuracy using data from top-tier universities like Oxford. The platform offers various features such as effortless data management, advanced OCR technology, AI-powered analytics, future cost predictions, and customization options. It also lists specific features like Claims Experience Analytics, Quotation Comparison, Request New Quotation, Provider Comparison, and an Underwriter Tool.\\n\\nThe required files for using the platform's features are mentioned, including Claims Experience, Active List, Raw Data (optional), and Providers List. There's a contact section with phone number and email.\\n\\nSo, putting this together, the documents primarily talk about Desaiv, its operations in the MENA region, their funding, mission, AI capabilities, platform features, required data inputs, and how to get in touch.\\n\\nThe main topic would be Desaiv as an insurtech company using AI for health insurance management. The documents cover their services, technology, market presence, funding, and operational aspects.\\n</think>\\n\\nThe main topic of the documents is Desaiv, an insurtech company that specializes in managing health insurance through AI-powered solutions. It operates in the MENA region, has secured pre-seed funding from notable investors, and aims to revolutionize the insurance sector with advanced analytics and automation tools. The platform offers features such as claims analytics, quotation comparison, provider networks, and underwriting tools, utilizing data inputs like claims experience and providers lists.\\n\\nAnswer: Desaiv is an insurtech company focused on health insurance management using AI solutions, operating in the MENA region with a focus on innovation through advanced analytics and automation.\""
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "answer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(embeddings)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(1, 384)"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "embeddings[0].shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
